Biostat 212a Homework 5

Due Mar 11, 2024 @ 11:59PM

Author

Yang An and UID:106332601

Published

March 14, 2024

1 ISL Exercise 9.7.1 (10pts)

  1. This problem involves hyperplanes in two dimensions.
  1. Sketch the hyperplane 1 + 3X1 − X2 = 0. Indicate the set of points for which 1+3X1 −X2 > 0, as well as the set of points for which 1 + 3X1 − X2 < 0.
  2. On the same plot, sketch the hyperplane −2 + X1 + 2X2 = 0. Indicate the set of points for which −2 + X1 + 2X2 > 0, as well as the set of points for which−2+X1+2X2 <0. Answer (a) Plotting the line: 1+3X1−X2=0⟺X2=3X1+1 Rearranging for X2 makes it clear which side of the hyperplane is which: 1+3X1−X2<0⟺X2>1+3X1 1+3X1−X2>0⟺X2<1+3X1
library(ggplot2)
library(dplyr)
library(latex2exp)

X1 <- -10:10
X2 <- 3*X1 + 1

df1 <- data.frame(X1, X2, line = "a")

grid <- expand.grid(X1 = seq(min(df1$X1), max(df1$X1), length.out = 150), 
                    X2 = seq(min(df1$X2), max(df1$X2), length.out = 150)) %>%
  mutate(region = case_when(1 + 3*X1 - X2 < 0 ~ "lt 0", 
                            1 + 3*X1 - X2 > 0 ~ "gt 0"))

ggplot(grid, aes(x = X1, y = X2)) + 
  geom_tile(aes(fill = region), alpha = 0.5) + 
  geom_line(data = df1, aes(x = X1, y = X2), size = 1, col = "purple") + 
  annotate("text", x = -5, y = 10, 
           label = TeX("$1 + 3X_1 - X_2 < 0$", output = "character"),
           hjust = 0.5, size = 4, parse = TRUE, col = "black") + 
  annotate("text", x = 5, y = -10, 
           label = TeX("$1 + 3X_1 - X_2 > 0$", output = "character"),
           hjust = 0.5, size = 4, parse = TRUE, col = "black") + 
  scale_fill_manual(values = c("blue", "green")) +
  scale_x_continuous(expand = c(0.01,0.01), breaks = seq(-10, 10, 2)) +
  scale_y_continuous(expand = c(0.01,0.01), breaks = seq(-30, 30, 10)) +
  theme(legend.position = "none") + 
  labs(title = TeX(r'(Hyperplane Plot: $1 + 3X_1 - X_2 = 0$)'), 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

(b) Plotting the line: −2+X1+2X2=0⟺X2=−X1/2+1 Rearranging for X2 makes it clear which side of the hyperplane is which: −2+X1+2X2<0⟺X2<−X1/2+1 −2+X1+2X2>0⟺X2>−X1/2+1

library(ggplot2)
library(dplyr)
library(latex2exp)
X1 <- -10:10
X2 <- 1 - 0.5*X1

df2 <- data.frame(X1, X2, line = "b")

grid <- grid %>%
  mutate(region = case_when(-2 + X1 + 2*X2 < 0 ~ "lt 0", 
                            -2 + X1 + 2*X2 > 0 ~ "gt 0"))

ggplot(grid, aes(x = X1, y = X2)) + 
  geom_tile(aes(fill = region), alpha = 0.5) + 
  geom_line(data = df2, aes(x = X1, y = X2), size = 1, col = "purple") + 
  annotate("text", x = 2.5, y = 15, 
           label = TeX("$-2 + X_1 + 2X_2 > 0$", output = "character"),
           hjust = 0.5, size = 4, parse = TRUE, col = "black") + 
  annotate("text", x = -2.5, y = -15, 
           label = TeX("$-2 + X_1 + 2X_2 < 0$", output = "character"),
           hjust = 0.5, size = 4, parse = TRUE, col = "black") + 
  scale_fill_manual(values = c("blue", "green")) +
  scale_x_continuous(expand = c(0.01,0.01), breaks = seq(-10, 10, 2)) +
  scale_y_continuous(expand = c(0.01,0.01), breaks = seq(-30, 30, 10)) +
  theme(legend.position = "none") + 
  labs(title = TeX("Hyperplane Plot: $-2 + X_1 + 2X_2 = 0$"), 
       x = TeX("($X_1$)"), 
       y = TeX("($X_2$)"))

library(ggplot2)
library(dplyr)
library(latex2exp)
bind_rows(df1, df2) %>%
  ggplot(aes(x = X1, y = X2, col = line)) + 
  geom_line(size = 1) + 
  scale_color_manual(values = c("green", "blue"), 
                     labels = unname(TeX(c("$1 + 3X_1 - X_2 = 0", "$-2 + X_1 + 2*X_2 = 0$")))) + 
  scale_x_continuous(breaks = seq(-10, 10, 2)) +
  scale_y_continuous(breaks = seq(-30, 30, 10)) +
  labs(x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'), 
       col = "Hyperplane:") + 
  theme(legend.position = "bottom")

2 ISL Exercise 9.7.2 (10pts)

  1. We have seen that in p = 2 dimensions, a linear decision boundary takes the form β0 +β1X1 +β2X2 = 0. We now investigate a non-linear decision boundary.
  1. Sketch the curve (1+X1)2 +(2−X2)2 =4.
  2. On your sketch, indicate the set of points for which (1+X1)2 +(2−X2)2 >4, as well as the set of points for which (1+X1)2 +(2−X2)2 ≤4.
  3. Suppose that a classifier assigns an observation to the blue class if and to the red class otherwise. To what class is the observation (1+X1)2 +(2−X2)2 >4, (0, 0) classified? (−1, 1)? (2, 2)? (3, 8)?
  4. Argue that while the decision boundary in (c) is not linear in terms of X1 and X2, it is linear in terms of X1, X12, X2, and X2 .

Answer (a) the curve is a circle with center (−1,2) and radius 2 (1+X1)2+(2−X2)2=4 (X1−(−1))2+((−1)(X2−2))2=4 (X1−(−1))2+(X2−2)2=2^2

library(ggplot2)

# Define the circle's data
create_circle <- function(x_center, y_center, radius, n_points = 100) {
  t <- seq(0, 2 * pi, length.out = n_points)
  x <- x_center + radius * cos(t)
  y <- y_center + radius * sin(t)
  data.frame(x, y)
}
circle_data <- data.frame(X1 = -1, X2 = 2, r = 2)

# Create ggplot object
ggplot() + 
   geom_point(data = create_circle(-1, 2, 2), aes(x, y), 
             color = "mediumseagreen", size = 1) + 
  scale_x_continuous(expand = c(0.01, 0.01), limits = c(-3.5, 1.5)) +
  scale_y_continuous(expand = c(0.01, 0.01), limits = c(-0.5, 4.5)) +
  labs(
    title = TeX(r'(Curve Plot: $(1 + X_1)^2 + (2 - X_2)^2 = 4$)'),
    x = TeX(r'($X_1$)'),
    y = TeX(r'($X_2$)')
  ) +
  coord_fixed()

  1. The set of points for which (1+X1)2 +(2−X2)2 >4 is the set of points outside the circle, and the set of points for which (1+X1)2 +(2−X2)2 ≤4 is the set of points inside the circle.
library(ggplot2)
library(dplyr)

# Define custom function to create a circle
create_circle <- function(x_center, y_center, radius, n_points = 100) {
  t <- seq(0, 2 * pi, length.out = n_points)
  x <- x_center + radius * cos(t)
  y <- y_center + radius * sin(t)
  data.frame(x, y)
}

# Generate grid data
grid <- expand.grid(X1 = seq(-3.5, 1.5, length.out = 200), 
                    X2 = seq(-0.5, 4.5, length.out = 200)) %>%
  mutate(region = ifelse((1 + X1)^2 + (2 - X2)^2 > 4, "gt 4", "le 4"))

# Create ggplot object
ggplot() + 
  geom_tile(data = grid, aes(x = X1, y = X2, fill = region), alpha = 0.5) + 
  geom_point(data = create_circle(-1, 2, 2), aes(x, y), 
             color = "mediumseagreen", size = 1) + 
  annotate("text", x = -1, y = 2, 
           label = TeX("$(1 + X_1)^2 + (2 - X_2)^2 \\leq 4$"), 
           hjust = 0.5, size = 4, parse = TRUE, col = "red") + 
  annotate("text", x = 0.5, y = 0, 
           label = TeX("$(1 + X_1)^2 + (2 - X_2)^2 > 4$"), 
           hjust = 0.5 , size = 2, parse = TRUE, col = "blue") + 
  scale_x_continuous(expand = c(0.01,0.01), limits = c(-3.5, 1.5)) + 
  scale_y_continuous(expand = c(0.01,0.01), limits = c(-0.5, 4.5)) + 
  scale_fill_manual(values = c("#00BFC4", "#F8766D")) +
  theme(legend.position = "none") + 
  labs(title = TeX(r'(Curve Plot: $(1 + X_1)^2 + (2 - X_2)^2 = 4$)'), 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)')) + 
  coord_fixed()
Warning: Removed 796 rows containing missing values (`geom_tile()`).
Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

Warning in is.na(x): is.na() applied to non-(list or vector) of type
'expression'

  1. The observation (1+X1)2 +(2−X2)2 >4, (0, 0) is classified as blue, (−1, 1) is classified as red, (2, 2) is classified as blue, and (3, 8) is classified as blue.
library(ggplot2)
library(dplyr)

# Define custom function to create a circle
create_circle <- function(x_center, y_center, radius, n_points = 100) {
  t <- seq(0, 2 * pi, length.out = n_points)
  x <- x_center + radius * cos(t)
  y <- y_center + radius * sin(t)
  data.frame(x, y)
}

# Define data for new observations
new_obs <- data.frame(X1 = c(0, -1, 2, 3), X2 = c(0, 1, 2, 8)) %>%
  mutate(region = ifelse((1 + X1)^2 + (2 - X2)^2 > 4, "gt 4", "le 4"))

# Generate grid data
grid <- expand.grid(X1 = seq(-3.5, 3.5, length.out = 200), 
                    X2 = seq(-0.5, 8.5, length.out = 200)) %>%
  mutate(region = ifelse((1 + X1)^2 + (2 - X2)^2 > 4, "gt 4", "le 4"))

# Create ggplot object
ggplot() + 
  geom_tile(data = grid, aes(x = X1, y = X2, fill = region), alpha = 0.5) + 
  geom_path(data = create_circle(-1, 2, 2), aes(x, y),
            col = "mediumseagreen", size = 1) + 
  geom_point(data = new_obs, aes(x = X1, y = X2, col = region)) + 
  scale_x_continuous(expand = c(0.01,0.01), limits = c(-3.5, 3.5)) + 
  scale_y_continuous(expand = c(0.01,0.01), limits = c(-0.5, 8.5)) + 
  scale_fill_manual(values = c("#00BFC4", "#F8766D")) +
  scale_color_manual(values = c("blue", "red")) +
  theme(legend.position = "none") + 
  labs(title = TeX(r'(Classifier Plot: $f(X) = (1 + X_1)^2 + (2 - X_2)^2 - 4$)'
                   ), 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)')) + 
  coord_fixed()
Warning: Removed 796 rows containing missing values (`geom_tile()`).

The decision boundary is defined as the boundary where the function f(X) changes its sign. In this case, f(X)=0 represents the boundary because it’s where the prediction changes. \[\begin{align*} f(X) & \Rightarrow f(X) = 0 \\ & = (1 + X_1)^2 + (2 - X_2)^2 - 4 \\ & = X_1^2 + 2X_1 + 1 + X_2^2 - 4X_2 + 4 - 4 \\ & = X_1^2 + X_2^2 + 2X_1 - 4X_2 + 1 \\ & = (1)(X_1^2) + (1)(X_2^2) + (2)(X_1) + (-4)(X_2) + 1 = 0 \end{align*}\] From the final form we can see how f(X)=0 is expressed in the form a1Z1+a2Z2+ a3Z3+a4Z4+b=0 for (Z1,Z2,Z3,Z4)=(X21,X22,X1,X2), and therefore it is linear in terms of x1, x1^2, x2, and x2^2. Thus, the decision boundary is a circle in the feature space.

3 Support vector machines (SVMs) on the Carseats data set (30pts)

Follow the machine learning workflow to train support vector classifier (same as SVM with linear kernel), SVM with polynomial kernel (tune the degree and regularization parameter \(C\)), and SVM with radial kernel (tune the scale parameter \(\gamma\) and regularization parameter \(C\)) for classifying Sales<=8 versus Sales>8. Use the same seed as in your HW4 for the initial test/train split and compare the final test AUC and accuracy to those methods you tried in HW4.

3.1 Machine Learning Workflow: SVM with Linear Kernel (Carseats Data))

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] latex2exp_0.9.6 dplyr_1.1.4     ggplot2_3.4.4  

loaded via a namespace (and not attached):
 [1] vctrs_0.6.4       cli_3.6.1         knitr_1.45        rlang_1.1.1      
 [5] xfun_0.41         stringi_1.7.12    generics_0.1.3    jsonlite_1.8.7   
 [9] labeling_0.4.3    glue_1.6.2        colorspace_2.1-0  htmltools_0.5.7  
[13] scales_1.3.0      fansi_1.0.4       rmarkdown_2.25    grid_4.3.1       
[17] evaluate_0.23     munsell_0.5.0     tibble_3.2.1      fastmap_1.1.1    
[21] yaml_2.3.7        lifecycle_1.0.3   stringr_1.5.0     compiler_4.3.1   
[25] htmlwidgets_1.6.4 pkgconfig_2.0.3   rstudioapi_0.15.0 farver_2.1.1     
[29] digest_0.6.33     R6_2.5.1          tidyselect_1.2.0  utf8_1.2.3       
[33] pillar_1.9.0      magrittr_2.0.3    withr_2.5.1       tools_4.3.1      
[37] gtable_0.3.4     
# Load necessary libraries
library(GGally)
library(gtsummary)
library(ranger)
library(tidyverse)
library(tidymodels)
library(ISLR2)

# Load the dataset
Sales <- ISLR2::Carseats %>%
  mutate(Sales = ifelse(Sales <= 8, "Low", "High")) 
  # Check the structure of the dataset
glimpse(Sales)
Rows: 400
Columns: 11
$ Sales       <chr> "High", "High", "High", "Low", "Low", "High", "Low", "High…
$ CompPrice   <dbl> 138, 111, 113, 117, 141, 124, 115, 136, 132, 132, 121, 117…
$ Income      <dbl> 73, 48, 35, 100, 64, 113, 105, 81, 110, 113, 78, 94, 35, 2…
$ Advertising <dbl> 11, 16, 10, 4, 3, 13, 0, 15, 0, 0, 9, 4, 2, 11, 11, 5, 0, …
$ Population  <dbl> 276, 260, 269, 466, 340, 501, 45, 425, 108, 131, 150, 503,…
$ Price       <dbl> 120, 83, 80, 97, 128, 72, 108, 120, 124, 124, 100, 94, 136…
$ ShelveLoc   <fct> Bad, Good, Medium, Medium, Bad, Bad, Medium, Good, Medium,…
$ Age         <dbl> 42, 65, 59, 55, 38, 78, 71, 67, 76, 76, 26, 50, 62, 53, 52…
$ Education   <dbl> 17, 10, 12, 14, 13, 16, 15, 10, 10, 17, 10, 13, 18, 18, 18…
$ Urban       <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, No, No, No, Yes, Ye…
$ US          <fct> Yes, Yes, Yes, Yes, No, Yes, No, Yes, No, Yes, Yes, Yes, N…
# Numerical summaries stratified by the outcome `Sales`.
Sales %>% tbl_summary(by = Sales)
Characteristic High, N = 1641 Low, N = 2361
CompPrice 127 (115, 136) 124 (115, 134)
Income 76 (58, 94) 65 (40, 88)
Advertising 10.0 (0.0, 14.0) 3.0 (0.0, 10.0)
Population 276 (150, 403) 268 (132, 389)
Price 107 (90, 123) 122 (107, 135)
ShelveLoc

    Bad 14 (8.5%) 82 (35%)
    Good 66 (40%) 19 (8.1%)
    Medium 84 (51%) 135 (57%)
Age 49 (36, 61) 58 (42, 70)
Education

    10 25 (15%) 23 (9.7%)
    11 17 (10%) 31 (13%)
    12 19 (12%) 30 (13%)
    13 17 (10%) 26 (11%)
    14 14 (8.5%) 26 (11%)
    15 18 (11%) 18 (7.6%)
    16 18 (11%) 29 (12%)
    17 21 (13%) 28 (12%)
    18 15 (9.1%) 25 (11%)
Urban 110 (67%) 172 (73%)
US 123 (75%) 135 (57%)
1 Median (IQR); n (%)
# Graphical summary:
 Sales %>% ggpairs()

# For reproducibility
set.seed(203)

data_split <- initial_split(
  Sales, 
  # stratify by AHD
  strata = "Sales", 
  prop = 0.75
  )
data_split
<Training/Testing/Total>
<300/100/400>
Sales_other <- training(data_split)
dim(Sales_other)
[1] 300  11
Sales_test <- testing(data_split)
dim(Sales_test)
[1] 100  11
svm_recipe <- 
  recipe(
    Sales ~ ., 
    data = Sales_other
  ) %>%
  # create traditional dummy variables (necessary for svm)
  step_dummy(all_nominal_predictors()) %>%
  # zero-variance filter
  step_zv(all_numeric_predictors()) %>% 
  # center and scale numeric data
  step_normalize(all_numeric_predictors())
svm_recipe
svm_mod3 <- 
  svm_poly(
    mode = "classification",
    cost = tune(),
    # scale_factor = tune()
  ) %>% 
  set_engine("kernlab")
svm_mod3
Polynomial Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = tune()

Computational engine: kernlab 
svm_wf3 <- workflow() %>%
  add_recipe(svm_recipe) %>%
  add_model(svm_mod3)
svm_wf3
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: svm_poly()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Polynomial Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = tune()

Computational engine: kernlab 
param_grid <- grid_regular(
  cost(range = c(-3, 2)),
  levels = c(5)
)
param_grid
# A tibble: 5 × 1
   cost
  <dbl>
1 0.125
2 0.297
3 0.707
4 1.68 
5 4    
set.seed(203)

folds <- vfold_cv(Sales_other, v = 5)
folds
#  5-fold cross-validation 
# A tibble: 5 × 2
  splits           id   
  <list>           <chr>
1 <split [240/60]> Fold1
2 <split [240/60]> Fold2
3 <split [240/60]> Fold3
4 <split [240/60]> Fold4
5 <split [240/60]> Fold5
svm_fit3 <- svm_wf3 %>%
  tune_grid(
    resamples = folds,
    grid = param_grid,
    metrics = metric_set(roc_auc, accuracy)
    )
svm_fit3
# Tuning results
# 5-fold cross-validation 
# A tibble: 5 × 4
  splits           id    .metrics          .notes          
  <list>           <chr> <list>            <list>          
1 <split [240/60]> Fold1 <tibble [10 × 5]> <tibble [0 × 3]>
2 <split [240/60]> Fold2 <tibble [10 × 5]> <tibble [0 × 3]>
3 <split [240/60]> Fold3 <tibble [10 × 5]> <tibble [0 × 3]>
4 <split [240/60]> Fold4 <tibble [10 × 5]> <tibble [0 × 3]>
5 <split [240/60]> Fold5 <tibble [10 × 5]> <tibble [0 × 3]>
svm_fit3 %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "roc_auc" ) %>%
  ggplot(mapping = aes(x = cost, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Cost", y = "CV AUC") +
  scale_x_log10()
# A tibble: 10 × 7
    cost .metric  .estimator  mean     n std_err .config             
   <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
 1 0.125 accuracy binary     0.873     5 0.0172  Preprocessor1_Model1
 2 0.125 roc_auc  binary     0.952     5 0.00836 Preprocessor1_Model1
 3 0.297 accuracy binary     0.853     5 0.0122  Preprocessor1_Model2
 4 0.297 roc_auc  binary     0.948     5 0.00754 Preprocessor1_Model2
 5 0.707 accuracy binary     0.857     5 0.0113  Preprocessor1_Model3
 6 0.707 roc_auc  binary     0.943     5 0.00678 Preprocessor1_Model3
 7 1.68  accuracy binary     0.86      5 0.0125  Preprocessor1_Model4
 8 1.68  roc_auc  binary     0.947     5 0.00891 Preprocessor1_Model4
 9 4     accuracy binary     0.863     5 0.00972 Preprocessor1_Model5
10 4     roc_auc  binary     0.947     5 0.00571 Preprocessor1_Model5

svm_fit3 %>%
  show_best("roc_auc")
# A tibble: 5 × 7
   cost .metric .estimator  mean     n std_err .config             
  <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 0.125 roc_auc binary     0.952     5 0.00836 Preprocessor1_Model1
2 0.297 roc_auc binary     0.948     5 0.00754 Preprocessor1_Model2
3 1.68  roc_auc binary     0.947     5 0.00891 Preprocessor1_Model4
4 4     roc_auc binary     0.947     5 0.00571 Preprocessor1_Model5
5 0.707 roc_auc binary     0.943     5 0.00678 Preprocessor1_Model3
best_svm3 <- svm_fit3 %>%
  select_best("roc_auc")
best_svm3
# A tibble: 1 × 2
   cost .config             
  <dbl> <chr>               
1 0.125 Preprocessor1_Model1
# Final workflow
final_wf3 <- svm_wf3 %>%
  finalize_workflow(best_svm3)
final_wf3
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: svm_poly()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Polynomial Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = 0.125

Computational engine: kernlab 
# Fit the whole training set, then predict the test cases
final_fit3 <- 
  final_wf3 %>%
  last_fit(data_split)
final_fit3
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits            id               .metrics .notes   .predictions .workflow 
  <list>            <chr>            <list>   <list>   <list>       <list>    
1 <split [300/100]> train/test split <tibble> <tibble> <tibble>     <workflow>
# Test metrics
final_fit3 %>% 
  collect_metrics()
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.87  Preprocessor1_Model1
2 roc_auc  binary         0.957 Preprocessor1_Model1
library(doParallel)

set.seed(101)
split_obj <- initial_split(data = Sales, prop = 0.7, strata = Sales)
train <- training(split_obj)
test <- testing(split_obj)


# Create the recipe
recipe(Sales ~ ., data = train) %>%
  # create traditional dummy variables (necessary for svm)
  step_dummy(all_nominal_predictors()) %>%
  # zero-variance filter
  step_zv(all_numeric_predictors()) %>% 
  # center and scale numeric data
  step_normalize(all_numeric_predictors()) %>%
  # estimate the means and standard deviations
  prep() -> recipe_obj

# Bake
train <- bake(recipe_obj, new_data=train)
test <- bake(recipe_obj, new_data=test)
library(vip)
final_fit3 %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  # vip(method = "permute", train= Sales)
  vip(method = "permute", 
      target = "Sales", metric = "accuracy",
      pred_wrapper = kernlab::predict, train = train)
Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
ℹ Please use `extract_fit_parsnip()` instead.

# Check for missing values in the training data
any_missing <- any(is.na(train))
any_missing
[1] FALSE
svm_linear_spec <- svm_linear() %>%
  set_mode("classification") %>%
  set_engine("kernlab")

svm_linear_fit <- svm_linear_spec %>%
  fit(Sales ~ ., data = train[, c('Price', 'Age', 'Sales')])
 Setting default kernel parameters  
svm_linear_fit %>%
  extract_fit_engine() %>%
  plot()

3.2 Machine Learning Workflow: SVM with Polynomial Kernel (Carseats Data)

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] vip_0.4.1          doParallel_1.0.17  iterators_1.0.14   foreach_1.5.2     
 [5] kernlab_0.9-32     ISLR2_1.3-2        yardstick_1.3.0    workflowsets_1.0.1
 [9] workflows_1.1.4    tune_1.1.2         rsample_1.2.0      recipes_1.0.10    
[13] parsnip_1.2.0      modeldata_1.3.0    infer_1.0.6        dials_1.2.1       
[17] scales_1.3.0       broom_1.0.5        tidymodels_1.1.1   lubridate_1.9.3   
[21] forcats_1.0.0      stringr_1.5.0      purrr_1.0.2        readr_2.1.4       
[25] tidyr_1.3.0        tibble_3.2.1       tidyverse_2.0.0    ranger_0.16.0     
[29] gtsummary_1.7.2    GGally_2.2.0       latex2exp_0.9.6    dplyr_1.1.4       
[33] ggplot2_3.4.4     

loaded via a namespace (and not attached):
 [1] rlang_1.1.1          magrittr_2.0.3       furrr_0.3.1         
 [4] compiler_4.3.1       vctrs_0.6.4          lhs_1.1.6           
 [7] pkgconfig_2.0.3      fastmap_1.1.1        backports_1.4.1     
[10] ellipsis_0.3.2       labeling_0.4.3       utf8_1.2.3          
[13] rmarkdown_2.25       prodlim_2023.08.28   markdown_1.12       
[16] tzdb_0.4.0           xfun_0.41            jsonlite_1.8.7      
[19] R6_2.5.1             stringi_1.7.12       RColorBrewer_1.1-3  
[22] parallelly_1.36.0    rpart_4.1.19         Rcpp_1.0.11         
[25] knitr_1.45           future.apply_1.11.1  Matrix_1.5-4.1      
[28] splines_4.3.1        nnet_7.3-19          timechange_0.2.0    
[31] tidyselect_1.2.0     rstudioapi_0.15.0    yaml_2.3.7          
[34] timeDate_4032.109    codetools_0.2-19     listenv_0.9.0       
[37] lattice_0.21-8       plyr_1.8.9           withr_2.5.1         
[40] evaluate_0.23        future_1.33.1        survival_3.5-5      
[43] ggstats_0.5.1        xml2_1.3.6           pillar_1.9.0        
[46] generics_0.1.3       hms_1.1.3            munsell_0.5.0       
[49] commonmark_1.9.0     globals_0.16.2       class_7.3-22        
[52] glue_1.6.2           tools_4.3.1          data.table_1.14.10  
[55] gower_1.0.1          grid_4.3.1           ipred_0.9-14        
[58] colorspace_2.1-0     cli_3.6.1            DiceDesign_1.10     
[61] fansi_1.0.4          broom.helpers_1.14.0 lava_1.7.3          
[64] gt_0.10.0            gtable_0.3.4         GPfit_1.0-8         
[67] sass_0.4.7           digest_0.6.33        htmlwidgets_1.6.4   
[70] farver_2.1.1         htmltools_0.5.7      lifecycle_1.0.3     
[73] hardhat_1.3.1        MASS_7.3-60         
# Load necessary libraries
library(GGally)
library(gtsummary)
library(ranger)
library(tidyverse)
library(tidymodels)
library(ISLR2)

# Load the dataset
Sales <- ISLR2::Carseats %>%
  mutate(Sales = ifelse(Sales <= 8, "Low", "High")) 
  # Check the structure of the dataset
glimpse(Sales)
Rows: 400
Columns: 11
$ Sales       <chr> "High", "High", "High", "Low", "Low", "High", "Low", "High…
$ CompPrice   <dbl> 138, 111, 113, 117, 141, 124, 115, 136, 132, 132, 121, 117…
$ Income      <dbl> 73, 48, 35, 100, 64, 113, 105, 81, 110, 113, 78, 94, 35, 2…
$ Advertising <dbl> 11, 16, 10, 4, 3, 13, 0, 15, 0, 0, 9, 4, 2, 11, 11, 5, 0, …
$ Population  <dbl> 276, 260, 269, 466, 340, 501, 45, 425, 108, 131, 150, 503,…
$ Price       <dbl> 120, 83, 80, 97, 128, 72, 108, 120, 124, 124, 100, 94, 136…
$ ShelveLoc   <fct> Bad, Good, Medium, Medium, Bad, Bad, Medium, Good, Medium,…
$ Age         <dbl> 42, 65, 59, 55, 38, 78, 71, 67, 76, 76, 26, 50, 62, 53, 52…
$ Education   <dbl> 17, 10, 12, 14, 13, 16, 15, 10, 10, 17, 10, 13, 18, 18, 18…
$ Urban       <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, No, No, No, Yes, Ye…
$ US          <fct> Yes, Yes, Yes, Yes, No, Yes, No, Yes, No, Yes, Yes, Yes, N…
# Numerical summaries stratified by the outcome `Sales`.
Sales %>% tbl_summary(by = Sales)
Characteristic High, N = 1641 Low, N = 2361
CompPrice 127 (115, 136) 124 (115, 134)
Income 76 (58, 94) 65 (40, 88)
Advertising 10.0 (0.0, 14.0) 3.0 (0.0, 10.0)
Population 276 (150, 403) 268 (132, 389)
Price 107 (90, 123) 122 (107, 135)
ShelveLoc

    Bad 14 (8.5%) 82 (35%)
    Good 66 (40%) 19 (8.1%)
    Medium 84 (51%) 135 (57%)
Age 49 (36, 61) 58 (42, 70)
Education

    10 25 (15%) 23 (9.7%)
    11 17 (10%) 31 (13%)
    12 19 (12%) 30 (13%)
    13 17 (10%) 26 (11%)
    14 14 (8.5%) 26 (11%)
    15 18 (11%) 18 (7.6%)
    16 18 (11%) 29 (12%)
    17 21 (13%) 28 (12%)
    18 15 (9.1%) 25 (11%)
Urban 110 (67%) 172 (73%)
US 123 (75%) 135 (57%)
1 Median (IQR); n (%)
# Graphical summary:
 Sales %>% ggpairs()

# For reproducibility
set.seed(203)

data_split <- initial_split(
  Sales, 
  # stratify by AHD
  strata = "Sales", 
  prop = 0.75
  )
data_split
<Training/Testing/Total>
<300/100/400>
Sales_other <- training(data_split)
dim(Sales_other)
[1] 300  11
Sales_test <- testing(data_split)
dim(Sales_test)
[1] 100  11
svm_recipe <- 
  recipe(
    Sales ~ ., 
    data = Sales_other
  ) %>%
  # create traditional dummy variables (necessary for svm)
  step_dummy(all_nominal_predictors()) %>%
  # zero-variance filter
  step_zv(all_numeric_predictors()) %>% 
  # center and scale numeric data
  step_normalize(all_numeric_predictors())
svm_recipe
svm_mod <- 
  svm_poly(
    mode = "classification",
    cost = tune(),
    degree = tune(),
    # scale_factor = tune()
  ) %>% 
  set_engine("kernlab")
svm_mod
Polynomial Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = tune()
  degree = tune()

Computational engine: kernlab 
svm_wf <- workflow() %>%
  add_recipe(svm_recipe) %>%
  add_model(svm_mod)
svm_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: svm_poly()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Polynomial Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = tune()
  degree = tune()

Computational engine: kernlab 
param_grid <- grid_regular(
  cost(range = c(-3, 2)),
  degree(range = c(1, 5)),
  #scale_factor(range = c(-1, 1)),
  levels = c(5)
  )
param_grid
# A tibble: 25 × 2
    cost degree
   <dbl>  <dbl>
 1 0.125      1
 2 0.297      1
 3 0.707      1
 4 1.68       1
 5 4          1
 6 0.125      2
 7 0.297      2
 8 0.707      2
 9 1.68       2
10 4          2
# ℹ 15 more rows
set.seed(203)

folds <- vfold_cv(Sales_other, v = 5)
folds
#  5-fold cross-validation 
# A tibble: 5 × 2
  splits           id   
  <list>           <chr>
1 <split [240/60]> Fold1
2 <split [240/60]> Fold2
3 <split [240/60]> Fold3
4 <split [240/60]> Fold4
5 <split [240/60]> Fold5
svm_fit <- svm_wf %>%
  tune_grid(
    resamples = folds,
    grid = param_grid,
    metrics = metric_set(roc_auc, accuracy)
    )
svm_fit
# Tuning results
# 5-fold cross-validation 
# A tibble: 5 × 4
  splits           id    .metrics          .notes          
  <list>           <chr> <list>            <list>          
1 <split [240/60]> Fold1 <tibble [50 × 6]> <tibble [0 × 3]>
2 <split [240/60]> Fold2 <tibble [50 × 6]> <tibble [0 × 3]>
3 <split [240/60]> Fold3 <tibble [50 × 6]> <tibble [0 × 3]>
4 <split [240/60]> Fold4 <tibble [50 × 6]> <tibble [0 × 3]>
5 <split [240/60]> Fold5 <tibble [50 × 6]> <tibble [0 × 3]>
svm_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "roc_auc" ) %>%
  ggplot(mapping = aes(x = degree, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Cost", y = "CV AUC") +
  scale_x_log10()
# A tibble: 50 × 8
    cost degree .metric  .estimator  mean     n std_err .config              
   <dbl>  <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
 1 0.125      1 accuracy binary     0.873     5 0.0172  Preprocessor1_Model01
 2 0.125      1 roc_auc  binary     0.952     5 0.00836 Preprocessor1_Model01
 3 0.297      1 accuracy binary     0.853     5 0.0122  Preprocessor1_Model02
 4 0.297      1 roc_auc  binary     0.948     5 0.00754 Preprocessor1_Model02
 5 0.707      1 accuracy binary     0.857     5 0.0113  Preprocessor1_Model03
 6 0.707      1 roc_auc  binary     0.943     5 0.00678 Preprocessor1_Model03
 7 1.68       1 accuracy binary     0.86      5 0.0125  Preprocessor1_Model04
 8 1.68       1 roc_auc  binary     0.947     5 0.00891 Preprocessor1_Model04
 9 4          1 accuracy binary     0.863     5 0.00972 Preprocessor1_Model05
10 4          1 roc_auc  binary     0.947     5 0.00571 Preprocessor1_Model05
# ℹ 40 more rows

svm_fit %>%
  show_best("roc_auc")
# A tibble: 5 × 8
   cost degree .metric .estimator  mean     n std_err .config              
  <dbl>  <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1 0.125      1 roc_auc binary     0.952     5 0.00836 Preprocessor1_Model01
2 0.297      1 roc_auc binary     0.948     5 0.00754 Preprocessor1_Model02
3 1.68       1 roc_auc binary     0.947     5 0.00891 Preprocessor1_Model04
4 4          1 roc_auc binary     0.947     5 0.00571 Preprocessor1_Model05
5 0.707      1 roc_auc binary     0.943     5 0.00678 Preprocessor1_Model03
best_svm <- svm_fit %>%
  select_best("roc_auc")
best_svm
# A tibble: 1 × 3
   cost degree .config              
  <dbl>  <dbl> <chr>                
1 0.125      1 Preprocessor1_Model01
# Final workflow
final_wf <- svm_wf %>%
  finalize_workflow(best_svm)
final_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: svm_poly()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Polynomial Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = 0.125
  degree = 1

Computational engine: kernlab 
# Fit the whole training set, then predict the test cases
final_fit <- 
  final_wf %>%
  last_fit(data_split)
final_fit
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits            id               .metrics .notes   .predictions .workflow 
  <list>            <chr>            <list>   <list>   <list>       <list>    
1 <split [300/100]> train/test split <tibble> <tibble> <tibble>     <workflow>
# Test metrics
final_fit %>% 
  collect_metrics()
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.87  Preprocessor1_Model1
2 roc_auc  binary         0.957 Preprocessor1_Model1
install.packages("doParallel", repos = "https://cloud.r-project.org/")

The downloaded binary packages are in
    /var/folders/rn/v9yxjglj63x6ltnmcz8bxqmc0000gn/T//RtmppgUd5P/downloaded_packages
library(doParallel)

set.seed(101)
split_obj <- initial_split(data = Sales, prop = 0.7, strata = Sales)
train <- training(split_obj)
test <- testing(split_obj)


# Create the recipe
recipe(Sales ~ ., data = train) %>%
  # create traditional dummy variables (necessary for svm)
  step_dummy(all_nominal_predictors()) %>%
  # zero-variance filter
  step_zv(all_numeric_predictors()) %>% 
  # center and scale numeric data
  step_normalize(all_numeric_predictors()) %>%
  # estimate the means and standard deviations
  prep() -> recipe_obj

# Bake
train <- bake(recipe_obj, new_data=train)
test <- bake(recipe_obj, new_data=test)
library(vip)
final_fit %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  # vip(method = "permute", train= Sales)
  vip(method = "permute", 
      target = "Sales", metric = "accuracy",
      pred_wrapper = kernlab::predict, train = train)

# Check for missing values in the training data
any_missing <- any(is.na(train))
any_missing
[1] FALSE
svm_rbf_spec <- svm_rbf() %>%
  set_mode("classification") %>%
  set_engine("kernlab")

svm_rbf_fit <- svm_rbf_spec %>%
  fit(Sales ~ ., data = train[, c('Price', 'Age', 'Sales')])

svm_rbf_fit %>%
  extract_fit_engine() %>%
  plot()

3.3 Machine Learning Workflow: SVM with RBF Kernel (Carseats Data))

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] vip_0.4.1          doParallel_1.0.17  iterators_1.0.14   foreach_1.5.2     
 [5] kernlab_0.9-32     ISLR2_1.3-2        yardstick_1.3.0    workflowsets_1.0.1
 [9] workflows_1.1.4    tune_1.1.2         rsample_1.2.0      recipes_1.0.10    
[13] parsnip_1.2.0      modeldata_1.3.0    infer_1.0.6        dials_1.2.1       
[17] scales_1.3.0       broom_1.0.5        tidymodels_1.1.1   lubridate_1.9.3   
[21] forcats_1.0.0      stringr_1.5.0      purrr_1.0.2        readr_2.1.4       
[25] tidyr_1.3.0        tibble_3.2.1       tidyverse_2.0.0    ranger_0.16.0     
[29] gtsummary_1.7.2    GGally_2.2.0       latex2exp_0.9.6    dplyr_1.1.4       
[33] ggplot2_3.4.4     

loaded via a namespace (and not attached):
 [1] rlang_1.1.1          magrittr_2.0.3       furrr_0.3.1         
 [4] compiler_4.3.1       vctrs_0.6.4          lhs_1.1.6           
 [7] pkgconfig_2.0.3      fastmap_1.1.1        backports_1.4.1     
[10] ellipsis_0.3.2       labeling_0.4.3       utf8_1.2.3          
[13] rmarkdown_2.25       prodlim_2023.08.28   markdown_1.12       
[16] tzdb_0.4.0           xfun_0.41            jsonlite_1.8.7      
[19] R6_2.5.1             stringi_1.7.12       RColorBrewer_1.1-3  
[22] parallelly_1.36.0    rpart_4.1.19         Rcpp_1.0.11         
[25] knitr_1.45           future.apply_1.11.1  Matrix_1.5-4.1      
[28] splines_4.3.1        nnet_7.3-19          timechange_0.2.0    
[31] tidyselect_1.2.0     rstudioapi_0.15.0    yaml_2.3.7          
[34] timeDate_4032.109    codetools_0.2-19     listenv_0.9.0       
[37] lattice_0.21-8       plyr_1.8.9           withr_2.5.1         
[40] evaluate_0.23        future_1.33.1        survival_3.5-5      
[43] ggstats_0.5.1        xml2_1.3.6           pillar_1.9.0        
[46] generics_0.1.3       hms_1.1.3            munsell_0.5.0       
[49] commonmark_1.9.0     globals_0.16.2       class_7.3-22        
[52] glue_1.6.2           tools_4.3.1          data.table_1.14.10  
[55] gower_1.0.1          grid_4.3.1           ipred_0.9-14        
[58] colorspace_2.1-0     cli_3.6.1            DiceDesign_1.10     
[61] fansi_1.0.4          broom.helpers_1.14.0 lava_1.7.3          
[64] gt_0.10.0            gtable_0.3.4         GPfit_1.0-8         
[67] sass_0.4.7           digest_0.6.33        htmlwidgets_1.6.4   
[70] farver_2.1.1         htmltools_0.5.7      lifecycle_1.0.3     
[73] hardhat_1.3.1        MASS_7.3-60         
# Load libraries
library(GGally)
library(gtsummary)
library(kernlab)
library(tidyverse)
library(tidymodels)
library(ISLR2)

# Load the dataset
Carsets <- ISLR2::Carseats %>%
  mutate(Sales = ifelse(Sales <= 8, "Low", "High")) 
  # Check the structure of the dataset
glimpse(Carsets)
Rows: 400
Columns: 11
$ Sales       <chr> "High", "High", "High", "Low", "Low", "High", "Low", "High…
$ CompPrice   <dbl> 138, 111, 113, 117, 141, 124, 115, 136, 132, 132, 121, 117…
$ Income      <dbl> 73, 48, 35, 100, 64, 113, 105, 81, 110, 113, 78, 94, 35, 2…
$ Advertising <dbl> 11, 16, 10, 4, 3, 13, 0, 15, 0, 0, 9, 4, 2, 11, 11, 5, 0, …
$ Population  <dbl> 276, 260, 269, 466, 340, 501, 45, 425, 108, 131, 150, 503,…
$ Price       <dbl> 120, 83, 80, 97, 128, 72, 108, 120, 124, 124, 100, 94, 136…
$ ShelveLoc   <fct> Bad, Good, Medium, Medium, Bad, Bad, Medium, Good, Medium,…
$ Age         <dbl> 42, 65, 59, 55, 38, 78, 71, 67, 76, 76, 26, 50, 62, 53, 52…
$ Education   <dbl> 17, 10, 12, 14, 13, 16, 15, 10, 10, 17, 10, 13, 18, 18, 18…
$ Urban       <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, No, No, No, Yes, Ye…
$ US          <fct> Yes, Yes, Yes, Yes, No, Yes, No, Yes, No, Yes, Yes, Yes, N…
# Numerical summaries stratified by the outcome `Sales`.
Carsets %>% tbl_summary(by = Sales)
Characteristic High, N = 1641 Low, N = 2361
CompPrice 127 (115, 136) 124 (115, 134)
Income 76 (58, 94) 65 (40, 88)
Advertising 10.0 (0.0, 14.0) 3.0 (0.0, 10.0)
Population 276 (150, 403) 268 (132, 389)
Price 107 (90, 123) 122 (107, 135)
ShelveLoc

    Bad 14 (8.5%) 82 (35%)
    Good 66 (40%) 19 (8.1%)
    Medium 84 (51%) 135 (57%)
Age 49 (36, 61) 58 (42, 70)
Education

    10 25 (15%) 23 (9.7%)
    11 17 (10%) 31 (13%)
    12 19 (12%) 30 (13%)
    13 17 (10%) 26 (11%)
    14 14 (8.5%) 26 (11%)
    15 18 (11%) 18 (7.6%)
    16 18 (11%) 29 (12%)
    17 21 (13%) 28 (12%)
    18 15 (9.1%) 25 (11%)
Urban 110 (67%) 172 (73%)
US 123 (75%) 135 (57%)
1 Median (IQR); n (%)
# Graphical summary:
Carsets %>% ggpairs()

# For reproducibility
set.seed(203)

data_split2 <- initial_split(
  Carsets, 
  # stratify by Sales
  strata = "Sales", 
  prop = 0.75
  )
data_split2
<Training/Testing/Total>
<300/100/400>
Carsets_other <- training(data_split2)
dim(Carsets_other)
[1] 300  11
Carsets_test <- testing(data_split2)
dim(Carsets_test)
[1] 100  11
svm_recipe2 <- 
  recipe(
    Sales ~ ., 
    data = Carsets_other
  ) %>%
  # create traditional dummy variables (necessary for svm)
  step_dummy(all_nominal_predictors()) %>%
  # zero-variance filter
  step_zv(all_numeric_predictors()) %>% 
  # center and scale numeric data
  step_normalize(all_numeric_predictors())
svm_recipe2
svm_mod2 <- 
  svm_rbf(
    mode = "classification",
    cost = tune(),
    rbf_sigma = tune()
  ) %>% 
  set_engine("kernlab")
svm_mod2
Radial Basis Function Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = tune()
  rbf_sigma = tune()

Computational engine: kernlab 
svm_wf2 <- workflow() %>%
  add_recipe(svm_recipe2) %>%
  add_model(svm_mod2)
svm_wf2
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: svm_rbf()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Radial Basis Function Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = tune()
  rbf_sigma = tune()

Computational engine: kernlab 
param_grid2 <- grid_regular(
  cost(range = c(-8, 5)),
  rbf_sigma(range = c(-5, -3)),
  levels = c(14, 5)
  )
param_grid2
# A tibble: 70 × 2
      cost rbf_sigma
     <dbl>     <dbl>
 1 0.00391   0.00001
 2 0.00781   0.00001
 3 0.0156    0.00001
 4 0.0312    0.00001
 5 0.0625    0.00001
 6 0.125     0.00001
 7 0.25      0.00001
 8 0.5       0.00001
 9 1         0.00001
10 2         0.00001
# ℹ 60 more rows
set.seed(203)

folds2 <- vfold_cv(Carsets_other, v = 5)
folds2
#  5-fold cross-validation 
# A tibble: 5 × 2
  splits           id   
  <list>           <chr>
1 <split [240/60]> Fold1
2 <split [240/60]> Fold2
3 <split [240/60]> Fold3
4 <split [240/60]> Fold4
5 <split [240/60]> Fold5
svm_fit2 <- svm_wf2 %>%
  tune_grid(
    resamples = folds2,
    grid = param_grid2,
    metrics = metric_set(roc_auc, accuracy)
    )
svm_fit2
# Tuning results
# 5-fold cross-validation 
# A tibble: 5 × 4
  splits           id    .metrics           .notes          
  <list>           <chr> <list>             <list>          
1 <split [240/60]> Fold1 <tibble [140 × 6]> <tibble [0 × 3]>
2 <split [240/60]> Fold2 <tibble [140 × 6]> <tibble [0 × 3]>
3 <split [240/60]> Fold3 <tibble [140 × 6]> <tibble [0 × 3]>
4 <split [240/60]> Fold4 <tibble [140 × 6]> <tibble [0 × 3]>
5 <split [240/60]> Fold5 <tibble [140 × 6]> <tibble [0 × 3]>
svm_fit2 %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "roc_auc") %>%
  ggplot(mapping = aes(x = cost, y = mean, alpha = rbf_sigma)) +
  geom_point() +
  geom_line(aes(group = rbf_sigma)) +
  labs(x = "Cost", y = "CV AUC") +
  scale_x_log10()
# A tibble: 140 × 8
      cost rbf_sigma .metric  .estimator  mean     n std_err
     <dbl>     <dbl> <chr>    <chr>      <dbl> <int>   <dbl>
 1 0.00391   0.00001 accuracy binary     0.59      5  0.0233
 2 0.00391   0.00001 roc_auc  binary     0.879     5  0.0158
 3 0.00781   0.00001 accuracy binary     0.59      5  0.0233
 4 0.00781   0.00001 roc_auc  binary     0.879     5  0.0158
 5 0.0156    0.00001 accuracy binary     0.59      5  0.0233
 6 0.0156    0.00001 roc_auc  binary     0.879     5  0.0158
 7 0.0312    0.00001 accuracy binary     0.59      5  0.0233
 8 0.0312    0.00001 roc_auc  binary     0.879     5  0.0158
 9 0.0625    0.00001 accuracy binary     0.59      5  0.0233
10 0.0625    0.00001 roc_auc  binary     0.879     5  0.0158
   .config              
   <chr>                
 1 Preprocessor1_Model01
 2 Preprocessor1_Model01
 3 Preprocessor1_Model02
 4 Preprocessor1_Model02
 5 Preprocessor1_Model03
 6 Preprocessor1_Model03
 7 Preprocessor1_Model04
 8 Preprocessor1_Model04
 9 Preprocessor1_Model05
10 Preprocessor1_Model05
# ℹ 130 more rows

svm_fit2 %>%
  show_best("roc_auc")
# A tibble: 5 × 8
   cost rbf_sigma .metric .estimator  mean     n std_err .config              
  <dbl>     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1    32  0.001    roc_auc binary     0.951     5 0.00997 Preprocessor1_Model70
2    16  0.001    roc_auc binary     0.943     5 0.0113  Preprocessor1_Model69
3    32  0.000316 roc_auc binary     0.934     5 0.0122  Preprocessor1_Model56
4     8  0.001    roc_auc binary     0.927     5 0.0121  Preprocessor1_Model68
5    16  0.000316 roc_auc binary     0.910     5 0.0136  Preprocessor1_Model55
best_svm2 <- svm_fit2 %>%
  select_best("roc_auc")
best_svm2
# A tibble: 1 × 3
   cost rbf_sigma .config              
  <dbl>     <dbl> <chr>                
1    32     0.001 Preprocessor1_Model70
# Final workflow
final_wf2 <- svm_wf2 %>%
  finalize_workflow(best_svm2)
final_wf2
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: svm_rbf()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Radial Basis Function Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = 32
  rbf_sigma = 0.001

Computational engine: kernlab 
# Fit the whole training set, then predict the test cases
final_fit2 <- 
  final_wf2 %>%
  last_fit(data_split2)
final_fit2
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits            id               .metrics .notes   .predictions .workflow 
  <list>            <chr>            <list>   <list>   <list>       <list>    
1 <split [300/100]> train/test split <tibble> <tibble> <tibble>     <workflow>
# Test metrics
final_fit2 %>% 
  collect_metrics()
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.84  Preprocessor1_Model1
2 roc_auc  binary         0.948 Preprocessor1_Model1

Comparing these metrics: AUC Comparison: The AUC for the current method RBF kernel(0.9479124), Polynomial Kernel and linear kernel (0.9565) are higher than that of the classification tree (0.85), random forest (0.91), and boosting methods (0.92) used in HW4. This indicates that the current method has better discrimination ability in distinguishing between the positive and negative classes compared to the methods used in HW4. Accuracy Comparison: The accuracy of the current method RBF kernel (0.84), Polynomial Kernel and linear kernel (0.87) are lower than that of random forest (0.88) and slightly lower than the boosting methods (0.87) used in HW4, but it’s higher than that of the classification tree (0.82) used in HW4. Based on these comparisons, the current method seems to outperform the methods used in HW4 in terms of AUC, but its accuracy is slightly lower than random forest and boosting methods while being higher than the classification tree. However, the choice of the best method depends on various factors such as the specific requirements of the application, interpretability, computational complexity, etc.

4 Bonus (10pts)

Let \[ f(X) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p = \beta_0 + \beta^T X. \] Then \(f(X)=0\) defines a hyperplane in \(\mathbb{R}^p\). Show that \(f(x)\) is proportional to the signed distance of a point \(x\) to the hyperplane \(f(X) = 0\). Answer To show that f(x) is proportional to the signed distance of a point x to the hyperplane f(X) = 0, we need to express the signed distance between the point x and the hyperplane in terms of f(x). The signed distance d(x) of a point x to the hyperplane f(X) = 0 is given by the projection of the vector x onto the unit normal vector \beta, which is perpendicular to the hyperplane. Then, the signed distance is given by: \[ d(x) = \frac{{\beta^T x}}{{|\beta|}} \]

Now, we need to relate d(x) to f(x). The equation of the hyperplane is f(X) = \beta_0 + \beta^T X, and when f(X) = 0, it defines the hyperplane. So, if we plug in X = x into f(X), we get:

\[ f(x) = \beta_0 + \beta^T x \]

Now, we’ll rewrite this equation to express \beta^T x in terms of f(x):

\[ \beta^T x = f(x) - \beta_0 \]

Now, substitute this expression for \beta^T x into the formula for d(x):

\[ d(x) = \frac{{f(x) - \beta_0}}{{|\beta|}} \]

This shows that the signed distance d(x) is proportional to f(x) - \beta_0. So, f(x) is proportional to the signed distance of the point x to the hyperplane f(X) = 0.